Importing relevant libraries

Data Ingestion

Checking for target column distribution

As we can see from above countplot, the dataset is highly imbalanced

EDA

Nov, May, Dec and March generating more revenues than other months

Region 1 has highest revenue

Operating system 2 has the maximum revenue

Browser 2 has maximum revenue

Traffictype 2 has highest revenue

Returning_Visitor has maximum revenue

Revenue on Weekend is maximum

Observation

Product categories from 0 to 200 are most viewed

Outliers Detection

Observation

Correlation Analysis

In the following we are dropping the session id and keep only one of the highly correlated columns by looking at the heatmap. What we find is that these columns do not impact the accuracy score. The score does not change either with or without keeping the columns in the dataset.

Support Vector Classifier

Visualising confusion matrix

Decision tree with hyperparameter tuning and crossvalidation

Visualising confusion matrix

Visualising ROC_AUC curve

Visualising the fitted Decision tree classifier

Random Forest with hyperparameter tuning with crossvalidation

Visualising confusion matrix

Visualising ROC_AUC curve

Classifying real test data with the fitted model